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ABSTRACT 

In this paper we show how to reorder the computations in the SOR 
algorithm to maintain the same asymptotic rate of convergence as the rowwise 
ordering and to obtain parallelism at different levels. A parallel program is 
written to illustrate these ideas and actual machines for implementation of 
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1. Introduction 

The effective implementation of a numerical method on a vector computer requires the computations to be 
ordered to allow vector operations. On computers with multiple processing elements, or with multiple instruction 
streams, the operations must be ordered in a way that permits them to be divided into groups that can execute in 
parallel with minimal communication. Since the ordering that results in either case is usually different from an 
ordering that would be used when implementing the method on sequential computers, we say that the operations 
have been reordered to achieve parallelism. (See Voigt[1985] for a description of four paradigms for achieving 
parallelism). 

Using reordering of operations as a paradigm for obtaining parallelism leads to two major problems if not 
properly controlled. Firstly, we are unlikely to develop an ordering that is best for every new parallel computer 
that is built or porposed. Instead, we would like to find an ordering that permits us to express through the seman- 
tics of a parallel programming language the levels of parallelism we wish to obtain. Then, if this language can be 
effectively implemented on the actual architecture, our algorithms will be portable. Secondly, ordering the compu- 
tations m different ways can change the mathematical properties of the algorithm. For example, convergence rates 
of iterative methods or roundoff error of gaussian elimination may be different when the computations are done in 
different orders. Hence, we would like orderings that satisfy some mathematical rule as well as provide parallel- 
ism. 

In section 2, a procedure is demonstrated for finding an ordering for the SOR (Successive Overrelaxation) 
method that yields the same asymptotic convergence rate as an ordering that is often used on sequential comput- 
ers. In section 3, we show how to group the operations and how to store the data within groups to achieve paral- 
lelism at two different levels. In section 4, we use PISCES Fortran, see Pratt[1985], to code the algorithm and 
hence express the parallelism that was described in section 3. This PISCES program is used to show the parallel- 
ism that may be obtained in the communication as well as the arithmetic in the algorithm. But just as important, 
this program points out the operations that must be or are currently being performed in sequence, thereby indicat- 
ing the overhead costs or areas for improvement in the algorithm. In section 5, we indicate what architectures our 
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algorithm can be implemented on by simply interpreting what the PISCES semantics would mean on these architec- 
tures Hie hope is to convince the reader by writing only one program that the ordering, the grouping of opera- 
tions, and the storage strategy within each group that was given in section 3 can be effectively implemented on a 
variety of architectures. 


2. A Parallel Ordering for SOR 

Suppose the system of linear equations, 


Au=tt 


( 1 ) 


with A symmetric and positive definite, has arisen from a discretization of an elliptic partial differential equation 
on a rectangular domain with the 9-point stencil as shown in Figure 1. 


XXX 
X o o 

X o o 

X o o 

XXX 

Figure 1 . 



x ia boundary node 


For simplicity assume there is one unknown per grid node and that Dirichlet boundary conditions are imposed. 
We wish to solve system (1) by the SOR method. 

The first job is to order the unknowns at the nodes in Figure 1 to indicate which nodes must be updated 
before others. An "ordering'* implies that the nodes must update sequentially. So instead, we would like to color 
the nodes, see Adams and Ortega[1982], so that nodes of the same color may update simultaneously. Then, the 
"ordering” is by colors with nodes of a given color ordered arbitrarily. This implies that two nodes can not be the 
same color if they are on the same 9-point stencil. It is easy to show that for this stencil, only three distinct four- 
color topologies can be used to color the nodes. These are shown in Figures 2a, 2b, and 2c below (the color pat- 
tern repeats beyond the region shown) 
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6 0 R B 

R B 6 O 
G O R B 
R B G 0 

Figure 2a. 


O B O B 
G R G R 
BOBO 
R G R G 

Figure 2b. 


G O G 0 
R B R B 
G O G O 
R B R B 

Figure 2e. 


For each figure above, we can order the colon in 41=24 ways. Hence, there are 72 different orderings we could 
choose, and the asymptotic convergence rates may be different for some of these orderings. 

To aid in choosing an ordering, we require our ordering to have the same asymptotic convergence rate as a 
baseline ordering; for example, the rowwise ordering of the domain -- bottom to top, left to right. This rowwise 
ordering is depicted by the stencil rule in Figure 3. 



Figure 3. Stencil Rule for Rowwise Ordering 


This figure indicates that a node may not be updated on iteration k + 1 until all nodes on the same stencil to the left 
and below have been updated on iteration Jfc+1 and that values from iteration k must be used for nodes on the 
same stendl to the right and above. 

Adams and Jordan[1984] give a systematic procedure for finding a particular 4-color ordering for this stencil 
that gives the same asymptotic convergence rate as the rowwise ordering. This strategy was also shown to produce 
multicolor orderings for many other stencils as well. The idea is to apply the stencil rule of Figure 3 to the grid in 
Figure 1 but permit nodes to update on subsequent iterations as soon as the ap p r o p riate data is available. This 
strategy leads to a sequence of update times for each node as shown in Figure 4. 
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Figure 4. R/B/G/O Coloring and Ordering 

Figure 4 consists of three distinct sets of nodes. Those in the first set are on the third iteration while those in the 
second set are on the second iteration while those in the third set are on the first iteration. We simply color all 
nodes with the same sequence of times the same color, and order the colors by the update times in the first set; 
that is, R/B/G/O are colors 1/2/3/4 respectively. On parallel computers, we will not process by the update times in 
Figure 4, but instead, the R nodes will be updated across the entire domain, followed by the B nodes, the G 
nodes, and finally the O nodes This ordering is one of the 24 orderings depicted in Figure 2a. In the next sec- 
tion, we show that this ordering permits us to group the operations and store the data within group to achieve 
parallelism at two different levels. 

3. Grouping Nodes to Tasks and Task Data Storage 

The nodes with the coloring shown in Figure 4 must be grouped into tasks that can run simultaneously. It is 
desirable to have an equal number of each color of nodes in each task so that all tasks can update nodes of the 
same color simultaneously. It is also desirable for programming considerations to have the same color pattern m 
each task. Figure 5 shows the unknowns grouped into 3 rows and 3 columns of tasks. Each task has a unique 
number (it,jt) as shown to indicate it is in the it—th task row and the jt—th task column. 
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( 2 , 1 ) 

xB G O R B 

xO R B G 0 

xB G O R B 

xO | R B G 0 

HTT) 

xB G 0 R B 

xO R B G O 

xB G 0 R B 

xO R B G O 

Figure 5. Grouping of Nodes to Update Tasks 

Next, we must decide upon a storage scheme for the unknowns, it, the right hand side h, and hence the coefficient 
matrix A, within each update task. We want to insure that a task can run on a vector computer as well as a 
sequential one because in the near future we are likely to have multiple processor systems for which each processor 
is a vector computer. Figures 6a, and 6b show two possible storage schemes for the unknowns in the (2,2) task of 
Figure 5 that permit vector operations within a task. 




B G 0 R B G 

1 2 3 4 5 6 

Figure 6a. Constant Stride 


B G O R B G 

10 19 28 1 11 20 

Figure 6b. Contiguous 
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The nodes within the box are unknowns the (2,2) task is to compute. The nodes around the box are neighbor task 
unknowns that must be known at particular times in the iteration; hence, they must be stored 

In Figure 6a, we number all nodes consecutively from 1 to 36, wrapping around from row to row In gen- 
eral, if we have tr rows and tc columns of nodes (including neighbor border nodes) we will store u. as a tr*tc linear 
vector. Notice that by including the neighbor task’s nodes in this storage scheme, nodes of the same color are a 
constant stride apart in memory (the stride is nc the number of colors). This means on a vector computer that 
allows a vector to be storage locations a constant stride apart, we may vectorize the updating of nodes of a given 
color within a task This would require the coefficients of A for this task to be stored as a two dimensional array, 
A (tr* tc , ns -tl) where ns+ 1 represents the 9 stencil locations, the last one being for the diagonal element Of 
course, at the end of updating a given color, we must reset the border nodes back to their original value since they 
were updated only to permit the computation to be vectorized. A control feature like that of the CYBER 205 can 
be used to prohibit these nodes from actually being updated. Notice that the nodes of the same color are also a 
constant stride apart m the vertical direction, namely, a stride of 2 *tr. Therefore, if a control feature is not avail- 
able, the resetting of values can be done by vector operations, since the nodes that require resetting form a vector 
Because nodes of the same color are a constant stride apart m both directions, updated values may be packaged as 
a vector to be sent to appropriate neighbor tasks and m turn the neighbor tasks can store the values as vectors 

Figure 6b shows the storage of unknowns for vector computers that require a vector to be a continuous set of 
storage locations Notice all R’s are numbered first, followed by all B’s, etc, since unknowns must be processed 
the same way they are stored The matrix A must be similarly organized Note that m the vertical direction, 
nodes of the same color are not contiguous Therefore, we would expect to have more overhead in packaging 
values to be communicated to neighbors 

At this point is it worth commenting that the coloring in Figure 2c does not permit vectonzation at the task 
level for either constant stride or contiguous vectors So, if some mathematical rule had selected the coloring of 
Figure 2c, we might backtrack and settle for other mathematical properties m order to achieve more parallelism 
There is always this tradeoff. 
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4. A PISCES Program for the R/B/G/O SOR Algorithm 

PISCES (Parallel Implementation of Scientific Computing Environments), see Pratt [1985], is a high-level pro- 
gramming environment for a carefully specified virtual parallel computer. It may be thought of as a parallel pro- 
gramming language (in the semantic sense) for the PISCES virtual computer. Programs written in PISCES FOR- 
TRAN may be run efficiently on various actual parallel computers if there is an efficient match between the virtual 
and real computers. This is analogous to saying that programs written in FORTRAN may be run efficiently on 
many sequential computers since the FORTRAN virtual computer may be efficiently mapped onto many actual 
sequential machines. For our purposes, the PISCES virtual computer views parallel computing the following way 
(For a precise definition, see Pratt[1985]). 

(1) Tasks are complete programs There is no shared data between tasks. 

(2) Tasks can initiate other tasks. The initiator is the parent. Tasks communicate to other tasks by message 
passing. Only a task knows its own id number. A task can only communicate to another task if the id 
number of the other task is known; hence, a task must give its id number to other tasks if messages are to be 
received from these tasks 

(3) Tasks accept messages by the message-type, not by any properties of the message sender A task may only 
accept a message of a given type if the task has a handler (a special subroutine) capable of processing mes- 
sages of this type. 

(4) Tasks and their handlers can run on different processors, but they must communicate through shared 
memory via the FORTRAN Common block 

(5) Fmer grains of parallelism can be obtained within a task Statement level parallelism is obtained via the 
parbegm-parend construct; parallelism across instances of a do loop is obtained with the pardo-contmue con- 
struct, and parallelism at the operation level is obtained via vector operations. 

We now describe the PISCES program of Figure 7 for the SOR algorithm that uses the ordering of Figure 4, 
the grouping of Figure 5, and the storage scheme of Figure 6a. The program has two types of tasks, a parent task 
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and an update task for each group of nodes in Figure 5. 

The first job of the parent task is to initiate all the ntr*ntc update tasks, each with its own data, its logical 
identification number (i j) and its starting and ending rows (is and ie), and its starting and ending columns (js and 
je ) of nodes. These rows and columns include the border nodes from neighbor update tasks as shown in Figure 6. 
These initiations may be done in parallel if enough processors are available on the actual hardware; otherwise the 
pardol and pardo2 loops can be changed or interpreted as ordinary sequential do loops 

Since the true physical identification numbers of the update tasks are not known until these tasks begin exe- 
cution, and since these tasks will need to communicate with each other, they must exchange ids. An easy way to 
accomplish this is to send their ids to the parent task and m turn the parent task collects the ids and then broad- 
casts the result back to all the update tasks. Now, the parent accepts the ids as ntr*ntc messages of type uientity 
The handler identity reads the data that is sent mto the parameter list like a read statement. Thus, the ids sent are 
stored by this handler m the array tidset (of type taslad) which is m memory common with the parent task The 
parent waits until all ids have been stored in the array tidset and then sends a message of type ad to all the update 
tasks for their future use 

The parent task then enters the SOR iteration loop and waits for ntr*ntc total local convergence messages, 
either type converge or type noconv, which are sent each iteration by the update tasks, to be accepted by the 
handler noconv or to be counted as a signal if the type is converge (Signals are special messages that require no 
processing by a handler.) These two types of messages can be accepted simultaneously since they are within the 
same accept statement, that is, the noconv handler could be running on a different processor than the task that is 
accepting signals If the variable global is true at the end of an iteration, the parent task exits the iteration loop 
and waits for the answers that are sent by each update task to be accepted by the answers handler The answers 
which are now stored m the u array are written to a file and the parent task terminates 

We note that all instances of a given handler can execute in parallel For example, all ntrxntc identity mes- 
sages can be received at once, provided enough processors are available There is always the cost consideration of 
whether obtaining all the parallelism inherent in the algorithm actually saves time The point here is that this 
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parallelism is expressable and parameter studies could be done to determine for what machines it is worthwhile and 
the implementation of PISCES adjusted accordingly. 

Next, we turn to the description of an update task. This task first sends its physical id (pppslf) back to the 
parent task m a message of type identity and then waits until handler tid has accepted from the parent and stored 
the task ids of all other update tasks. An update task only needs to communicate to nhbs other update tasks 
whose logical ids are stored in neigh (nhbs, 2). Notice that the task ids of all the update tasks are stored in tashd 
even though not all will be used. This is because it is probably less time consuming for the parent task to send all 
the ids rather than to package and send individually to each task just what is needed. 

The update task now begins the SOR iteration loop 100, and the loop 200 over the number of colors, nc 
Each time through the 200 loop, the task processes all nodes of the same color assuming the storage arrangement 
in Figure 6a. The notation temp(sfjp-/ic)=0 means that locations of the linear array temp beginning in location st, 
ending m location sp with stride nc are set to zero as a vector operation. Likewise, A(stpp/ic ,nst) means all rows 
starting with row st ending with row sp, incrementing by stride nc, with column nst fixed are processed as a vec- 
tor Note that the arithmetic in the algorithm within a task vectorizes in statements 201-203, and a vector dot pro- 
duct is done in statement 204 

After the values of a given color have been updated, the border values of this color are sent to appropriate 
neighbors This is done for each neighbor in parallel with the pardo 300 statement. The update task retrieves the 
package to be sent as a vector with starting location st(ne), stopping location sp(ne), and stride tnc(ne) and the 
neighbor’s handler stores the package as a vector with starting location nest(ne), stopping location nsp(ne), and 
stride mnc(ne ) Of particular note is the type of message sent; namely, ncolors.uvals. There are nc different u 

value handlers, one to deal with messages of each color from all the update tasks. Only instances of handlers for a 
given color can work in parallel since a task must accept all R’s before the B’s can be accepted for the next itera- 
tion The reason for having nc different handlers to accept the u values instead of only one is not for parallelism, 
but rather to ensure correctness and simplicity of the program. That is, if all messages were of type uvals and a 
neighbor update task operates faster and sends a message with B values, this message could be counted m the 
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total. In this case, all R values would not be accepted before the update task begins to compute B values. That is, 
we can not be guaranteed that messages arrive in the same order they are sent, but by accepting only messages of 
the type ncolors uvals we are guaranteed that all our total messages are R ones on the first pass through the 200 
loop. Of course, we could add a data field to indicate color to messages of type uvals This would require more 
processing to examine the messages and more storage to save messages of the wrong color until they are needed, 
thereby adding to the complexity of the program. 

After all colors have been processed, the local error is determined as a dot product er T er m statement 204 
The task then sends a message of type converge or noconv to the parent and awaits the decision. The parent sends 
this decision as a message of type conyflag which handler convflag records in the variable global. If converged, the 
update task sends its tr*tc u values back to the parent with a message of type answers ; otherwise, the next iteration 
begins 

These programs show the levels of parallelism that are achievable with the R/B/G/O ordering, the grouping 
of unknowns, and the data storage scheme within a task that were given in Figures 4, 5, and 6a respectively. This 
parallelism can be broken mto parallel arithmetic and parallel communication as summarized below 

Parallel Arithmetic 

(1) Nodes in different tasks update simultaneously 

(2) Nodes of the same color within a task can execute simultaneously, probably as vector operations 
Parallel Communication 

(1) All update tasks can be initiated simultaneously 

(2) The parent accepts the task ids in parallel from the update tasks. The update tasks can then accept the 
broadcasted result simultaneously 

(3) The parent accepts convergence information from the update tasks m parallel. The broadcasted convergence 
decision may be accepted in parallel by the tasks. 
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(4) Updated values can be sent to neighbors simultaneously. 

(5) Values of the same color may be accepted from all neighbors simultaneously. 

(6) Answers said back in parallel from the update tasks can be accepted in parallel by the parent. 

However, the algorithm points out where sequential operation seems to prevail. 

Sequential Arithmetic 

(1) The dot product is partially sequential and can be more costly than vector multiplication, for example, on 
vector machines. The dot products are done in parallel across tasks. 

(2) Starting and stopping indices must be computed within each task, but are done in parallel across tasks. 

5. Possible Architectures for Implementation 

The PISCES program that was described in the last section shows that the R/B/G/O SOR algorithm can be 
run on different parallel computer architectures. We conclude by mentioning four types of architectures and the 
proper interpretation of the tasks and handlers for these architectures. 

(1) If only one task is defined and the vector operations are converted to do loops, no handlers are needed and 
the result is the sequential machine algorithm. 

(2) If only one task is defined and the vector operations remain, no handlers are needed and the result is a con- 
stant stride vector program that could run on the CRAY, for example. (Syntax would need repairing, but 
the semantics are the same) 

(3) Assume a task and its handlers run on the same processor and that we have many tasks and many processors. 
Also assume that each processor is a scalar processor and that the processors do not have shared memory, 
but are connected by some interconnection network. Then, if the interconnection network is that of the 
hypercube, for example, our algorithm will run on Caltech’s Hypercube. If instead, the interconnection net- 
work is that of a flat 8 nearest neighbor connection, the algorithm will run on NASA Langley’s Finite Ele- 


ment Machine. 
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(4) If we relax the requirement that tasks can not have shared memory (deviating from PISCES), and let several 
tasks run on the same processor in multiple instruction stream fashion (pipelined MIMD), we see that our 
algonthms can be run on the Denelcor HEP with one or multiple PEMs The accept statement would be 
analogous to synchronization through a shared variable; since, for example, a task would only need to wait 
until R values were updated by all tasks (HEP proceses) before the updating of B values is begun 

(5) To actually realize all the levels of parallelism in our algorithm, we will have to wait until novel architectures 
comprised of many processors, are available Some of the processors would be vector processors for the 
update tasks while others would be scalar processors for the handlers and the parent task Since the proces- 
sor for an update task and the processors for the the update task’s handlers require shared data they could be 
connected to a shared memory. A communication network might be used to connect different update 
processor/handler processor clusters 
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tasktype parent 

parameter (ntr* 9 ntc* 9 n* 9 maxi ter* , <other parameters) 

common /withidentity/ tidset 

common /withnoconv/ global 

common /withanawers/ u 

real u(n 9 n) 9 <other real declarations> 

integer < integer declarations> 

taskid tidset (ntr 9 ntc) 

handler noconv 9 answers 9 identity 

signal converge 

end declarations 

♦initiate the ntr#ntc update tasks 
pardo 1 i*1 9 ntr 

pardo 2 j*1 9 ntc 

is(i 9 j)- 9 ie(i 9 j). 

ja(i 9 j)- 9 js(i,j)- 

on same initiate updated, j 9 is(i 9 j) 9 ie(i, j) 9 js(i 9 j) 9 je(i 9 j) 9 <other data>) 

2 continue 

1 continue 

♦accept the task ids of the tasks and send results back to all update tasks 
accept (ntrantc) 
identity 
end 

to all send tid( ( (tidset (i 9 j ) t i*1 9 ntr) 9 j*1 9 ntc) ) 

♦begin the iteration 

do 100 iter* 1 9 maxi ter 
global* . true 
accept ( ntr ♦ntc ) of 
converge 
noconv 
end accept 

to all send convf lag (global) 
if (global) then go to 3 

100 continue 

♦accept the answers from all the tasks 

3 accept ( ntr^ntc ) 

answers 
end accept 

♦write the answers to file 6 
pardo 4 i«1,n 

write(6 9 ^) (u(i 9 j ) 9 j*1 9 n) 

4 continue 

stop 

end 


Figure 7. PISCES Program (continued on next 3 pages) 
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handler noconv 

coupo n /withnoconv/ global 

logical global 

end declarations 

global-. false 

return 

end 

handler Identity (it, jt,tidset(it, jt) ) 

parameter (ntr» 9 ntc* ) 

coupon /withidentity/ tidset 

integer it 9 jt 

taskid tidset (ntr,ntc) 

end declarations 

return 

end 

handler answers (is f ie 9 js 9 je 9 ( (u(i 9 j) 9 i»is 9 ie) 9 j»js 9 je) ) 

parameter (n» ) 

common /withanswers/ u 

integer is 9 ie 9 js 9 je 

real u(n 9 n) 

end declarations 

return 

end 
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tasktype update (it, jt,is,ie, js f je,< task's other data>) 

parameter (ntr* 9 ntc* ,tr* ,tc* 9 nc* ,w» 9 ns* 9 nbhs* , maxi ter* ,< task's other parameters ) 
common /withtid/ tideet 
commo n /withconvflag/ global 
common /withuvale/ u 

integer it 9 jt 9 is 9 js <other integer declaration8> 

real u(tr«tc) 9 b(tr«tc) 9 A(tr*tc 9 ns+1 ) t stencil (ne) 9 neigh(nhbe 9 2) 9 <othere> 
logical global 
taekid tideet ( ntr 9 ntc ) 

signal done 9 converge 

handler tid 9 convflag 9 (k.uvals 9 k*1 9 nc) 
end declarations 

♦ retrieve task ids of other update tasks 

to parent send identity ( it 9 jt 9 pppslf ) 
accept 
tid 

end accept 

♦begin the iteration 

do 100 iter* 1 9 maxi ter 
do 200 ncolor8*1 9 nc 

st* 9 sp* 

201 temp(st;sp;nc)*0 
do 202 nst*1 9 ns 

tempi st ; sp;nc ) *temp(st ;sp;nc)+A(st ; sp ; nc t ns t)*u(st+ stencil (net) ; sp+ stencil (net) ;nc 

202 continue 

<reset temp to be u at the boundary nodes> 
temp(st;sp;nc)*(b(st;sp;nc)-temp(st;sp;nc) )/A(st;sp;nc 9 ns+1 ) 
er(st;sp;nc)*temp(st;sp;nc)-u(st;sp;nc) 

203 u(st;sp;nc)*v*temp(st;sp;nc) + ( 1-w)*temp(st;sp;nc) 

♦send appropriate u values of this color to all neighbors 
pardo 300 ne*1 9 nbhs 
num(ne)* 

if (num(ne) .ne.0)then 

st(ne)* 9 sp(ne)» , inc ( ne ) ■ 
nest(ne)* t nsp(ne) 9 ninc(ne)* 

to tidset( neigh (ne 9 1 )+it 9 neigh(ne 9 2)+ jt) send ncolors.uvals 
(nest(ne) 9 nsp(ne) 9 ninc(ne) 9 (u(i) 9 i*st(ne) 9 sp(ne) 9 inc(ne) ) ) 
end if 

300 continue 

♦accept (total) messages of type ncolors.uvals from any neighbor 
total* 

accept ( total ) 
ncolors.uvals 
end accept 


200 


continue 



16 


♦check for local and then global convergence 
e»dot(er»er) 
if (e.lt.eps) then 

to parent send converge 

else 


to parent send noconv 

end if 


♦accept convergence decision from parent 
accept 

con vf lag 
end accept 

♦if converged then exit iteration loop 
if (global) then go to 5 

100 continue 

♦send answers back to parent 

5 to parent send answers (is,ie, js, je, (u(i) ,i»1 ,tr^tc) ) 
stop 
end 


handler tid ( ( (tidset(i p j) ,i»1 f ntr) , j»1 p ntc) ) 

parameter (ntr» p ntr» ) 

common /withtid/ tidset 

taskid tidset (ntr # ntc) 

end declarations 

return 

end 

handler convf lag (global) 

common /withconvflag/ global 

logical global 

end declarations 

return 

end 

handler k.uvals (st,sp,inc, (u(i) ,i»st, sp f inc) ) 

parameter (tr» »tc» p nc» ) 

common /withuvals/ u 

integer st,sp,inc 

real u(tr,tc) 

return 

end 
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